Finding Frequent Structural Features among Words in Tree-Structured Documents
Identifieur interne : 000998 ( Main/Exploration ); précédent : 000997; suivant : 000999Finding Frequent Structural Features among Words in Tree-Structured Documents
Auteurs : Tomoyuki Uchida [Japon] ; Tomonori Mogawa [Japon] ; Yasuaki Nakamura [Japon]Source :
- Lecture Notes in Computer Science [ 0302-9743 ]
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Document électronique.
English descriptors
- KwdEn :
Abstract
Abstract: Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k≥ 2 be an integer and (W 1,W 2,...,W k ) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W 1 , W 2 ,..., W k ) is a sequence 〈t 1;t 2;...,t k − 1〉 of labeled rooted ordered trees such that, for i=1,2,...,k-1, (1) t i consists of only one node having the pair (W i ,W i + 1) as its label, or (2) t i has just two nodes whose degrees are one and which are labeled with W i and W i + 1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.
Url:
DOI: 10.1007/978-3-540-24775-3_43
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 001080
- to stream Istex, to step Curation: 000D30
- to stream Istex, to step Checkpoint: 000927
- to stream Main, to step Merge: 000A07
- to stream PascalFrancis, to step Corpus: 000033
- to stream PascalFrancis, to step Curation: 000149
- to stream PascalFrancis, to step Checkpoint: 000020
- to stream Main, to step Merge: 000B09
- to stream Main, to step Curation: 000998
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Finding Frequent Structural Features among Words in Tree-Structured Documents</title>
<author><name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
</author>
<author><name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
</author>
<author><name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:40B3A82782E1E9F30CED10BDF201F393113E4EB9</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-24775-3_43</idno>
<idno type="url">https://api.istex.fr/ark:/67375/HCB-3GZ3WVK7-F/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001080</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001080</idno>
<idno type="wicri:Area/Istex/Curation">000D30</idno>
<idno type="wicri:Area/Istex/Checkpoint">000927</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000927</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Uchida T:finding:frequent:structural</idno>
<idno type="wicri:Area/Main/Merge">000A07</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:04-0300385</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000033</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000149</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000020</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000020</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Uchida T:finding:frequent:structural</idno>
<idno type="wicri:Area/Main/Merge">000B09</idno>
<idno type="wicri:Area/Main/Curation">000998</idno>
<idno type="wicri:Area/Main/Exploration">000998</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Finding Frequent Structural Features among Words in Tree-Structured Documents</title>
<author><name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Information Sciences, Hiroshima City University, 731-3194, Hiroshima</wicri:regionArea>
<wicri:noRegion>Hiroshima</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Department of Computer and Media Technologies, Hiroshima City University, 731-3194, Hiroshima</wicri:regionArea>
<wicri:noRegion>Hiroshima</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Information Sciences, Hiroshima City University, 731-3194, Hiroshima</wicri:regionArea>
<wicri:noRegion>Hiroshima</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s" type="main" xml:lang="en">Lecture Notes in Computer Science</title>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Data analysis</term>
<term>Data mining</term>
<term>Electronic document</term>
<term>File structure</term>
<term>HTML language</term>
<term>Information extraction</term>
<term>Latex</term>
<term>SGML language</term>
<term>Text</term>
<term>Tree structured method</term>
<term>Word</term>
<term>XML language</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Analyse donnée</term>
<term>Document électronique</term>
<term>Extraction information</term>
<term>Fouille donnée</term>
<term>Langage HTML</term>
<term>Langage SGML</term>
<term>Langage XML</term>
<term>Latex</term>
<term>Mot</term>
<term>Méthode arborescente</term>
<term>Structure fichier</term>
<term>Texte</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Document électronique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k≥ 2 be an integer and (W 1,W 2,...,W k ) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W 1 , W 2 ,..., W k ) is a sequence 〈t 1;t 2;...,t k − 1〉 of labeled rooted ordered trees such that, for i=1,2,...,k-1, (1) t i consists of only one node having the pair (W i ,W i + 1) as its label, or (2) t i has just two nodes whose degrees are one and which are labeled with W i and W i + 1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.</div>
</front>
</TEI>
<affiliations><list><country><li>Japon</li>
</country>
</list>
<tree><country name="Japon"><noRegion><name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
</noRegion>
<name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
<name sortKey="Mogawa, Tomonori" sort="Mogawa, Tomonori" uniqKey="Mogawa T" first="Tomonori" last="Mogawa">Tomonori Mogawa</name>
<name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
<name sortKey="Nakamura, Yasuaki" sort="Nakamura, Yasuaki" uniqKey="Nakamura Y" first="Yasuaki" last="Nakamura">Yasuaki Nakamura</name>
<name sortKey="Uchida, Tomoyuki" sort="Uchida, Tomoyuki" uniqKey="Uchida T" first="Tomoyuki" last="Uchida">Tomoyuki Uchida</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Informatique/explor/SgmlV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000998 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000998 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Informatique |area= SgmlV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:40B3A82782E1E9F30CED10BDF201F393113E4EB9 |texte= Finding Frequent Structural Features among Words in Tree-Structured Documents }}
This area was generated with Dilib version V0.6.33. |